default policy
Goals and the Structure of Experience
Amir, Nadav, Tiomkin, Stas, Langdon, Angela
Purposeful behavior is a hallmark of natural and artificial intelligence. Its acquisition is often believed to rely on world models, comprising both descriptive (what is) and prescriptive (what is desirable) aspects that identify and evaluate state of affairs in the world, respectively. Canonical computational accounts of purposeful behavior, such as reinforcement learning, posit distinct components of a world model comprising a state representation (descriptive aspect) and a reward function (prescriptive aspect). However, an alternative possibility, which has not yet been computationally formulated, is that these two aspects instead co-emerge interdependently from an agent's goal. Here, we describe a computational framework of goal-directed state representation in cognitive agents, in which the descriptive and prescriptive aspects of a world model co-emerge from agent-environment interaction sequences, or experiences. Drawing on Buddhist epistemology, we introduce a construct of goal-directed, or telic, states, defined as classes of goal-equivalent experience distributions. Telic states provide a parsimonious account of goal-directed learning in terms of the statistical divergence between behavioral policies and desirable experience features. We review empirical and theoretical literature supporting this novel perspective and discuss its potential to provide a unified account of behavioral, phenomenological and neural dimensions of purposeful behaviors across diverse substrates.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Texas (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Leisure & Entertainment (0.67)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.67)
Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning
Singh, Rohan P., Morisawa, Mitsuharu, Benallegue, Mehdi, Xie, Zhaoming, Kanehiro, Fumio
Email: rohan-singh@aist.go.jp Abstract -- For the deployment of legged robots in real-world environments, it is essential to develop robust locomotion control methods for challenging terrains that may exhibit unexpected deformability and irregularity. In this paper, we explore the application of sim-to-real deep reinforcement learning (RL) for the design of bipedal locomotion controllers for humanoid robots on compliant and uneven terrains. Our key contribution is to show that a simple training curriculum for exposing the RL agent to randomized terrains in simulation can achieve robust walking on a real humanoid robot using only proprioceptive feedback. We train an end-to-end bipedal locomotion policy using the proposed approach, and show extensive real-robot demonstration on the HRP-5P humanoid over several difficult terrains inside and outside the lab environment. Further, we argue that the robustness of a bipedal walking policy can be improved if the robot is allowed to exhibit aperiodic motion with variable stepping frequency. We propose a new control policy to enable modification of the observed clock signal, leading to adaptive gait frequencies depending on the terrain and command velocity. Through simulation experiments, we show the effectiveness of this policy specifically for walking over challenging terrains by controlling swing and stance durations. This is primarily due to the strict temporal and spatial assumptions placed by such approaches on the foot trajectories and environmental contacts [1], [2]. When faced with an irregular or compliant (i.e.
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.04)
- Instructional Material > Course Syllabus & Notes (0.54)
- Research Report (0.50)
Context-Generative Default Policy for Bounded Rational Agent
Pushp, Durgakant, Xu, Junhong, Chen, Zheng, Liu, Lantao
Bounded rational agents often make decisions by evaluating a finite selection of choices, typically derived from a reference point termed the $`$default policy,' based on previous experience. However, the inherent rigidity of the static default policy presents significant challenges for agents when operating in unknown environment, that are not included in agent's prior knowledge. In this work, we introduce a context-generative default policy that leverages the region observed by the robot to predict unobserved part of the environment, thereby enabling the robot to adaptively adjust its default policy based on both the actual observed map and the $\textit{imagined}$ unobserved map. Furthermore, the adaptive nature of the bounded rationality framework enables the robot to manage unreliable or incorrect imaginations by selectively sampling a few trajectories in the vicinity of the default policy. Our approach utilizes a diffusion model for map prediction and a sampling-based planning with B-spline trajectory optimization to generate the default policy. Extensive evaluations reveal that the context-generative policy outperforms the baseline methods in identifying and avoiding unseen obstacles. Additionally, real-world experiments conducted with the Crazyflie drones demonstrate the adaptability of our proposed method, even when acting in environments outside the domain of the training distribution.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York (0.04)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Learning telic-controllable state representations
Amir, Nadav, Tiomkin, Stas, Langdon, Angela
Computational descriptions of purposeful behavior comprise both descriptive and normative} aspects. The former are used to ascertain current (or future) states of the world and the latter to evaluate the desirability, or lack thereof, of these states under some goal. In Reinforcement Learning, the normative aspect (reward and value functions) is assumed to depend on a predefined and fixed descriptive one (state representation). Alternatively, these two aspects may emerge interdependently: goals can be, and indeed often are, approximated by state-dependent reward functions, but they may also shape the acquired state representations themselves. Here, we present a novel computational framework for state representation learning in bounded agents, where descriptive and normative aspects are coupled through the notion of goal-directed, or telic, states. We introduce the concept of telic controllability to characterize the tradeoff between the granularity of a telic state representation and the policy complexity required to reach all telic states. We propose an algorithm for learning controllable state representations, illustrating it using a simple navigation task with shifting goals. Our framework highlights the crucial role of deliberate ignorance -- knowing which features of experience to ignore -- for learning state representations that balance goal flexibility and policy complexity. More broadly, our work advances a unified theoretical perspective on goal-directed state representation learning in natural and artificial agents.
- North America > United States > Montana (0.04)
- Asia > Middle East > Israel (0.04)
Matchings, Predictions and Counterfactual Harm in Refugee Resettlement Processes
Lee, Seungeon, Benz, Nina Corvelo, Thejaswi, Suhas, Gomez-Rodriguez, Manuel
Resettlement agencies have started to adopt data-driven algorithmic matching to match refugees to locations using employment rate as a measure of utility. Given a pool of refugees, data-driven algorithmic matching utilizes a classifier to predict the probability that each refugee would find employment at any given location. Then, it uses the predicted probabilities to estimate the expected utility of all possible placement decisions. Finally, it finds the placement decisions that maximize the predicted utility by solving a maximum weight bipartite matching problem. In this work, we argue that, using existing solutions, there may be pools of refugees for which data-driven algorithmic matching is (counterfactually) harmful -- it would have achieved lower utility than a given default policy used in the past, had it been used. Then, we develop a post-processing algorithm that, given placement decisions made by a default policy on a pool of refugees and their employment outcomes, solves an inverse~matching problem to minimally modify the predictions made by a given classifier. Under these modified predictions, the optimal matching policy that maximizes predicted utility on the pool is guaranteed to be not harmful. Further, we introduce a Transformer model that, given placement decisions made by a default policy on multiple pools of refugees and their employment outcomes, learns to modify the predictions made by a classifier so that the optimal matching policy that maximizes predicted utility under the modified predictions on an unseen pool of refugees is less likely to be harmful than under the original predictions. Experiments on simulated resettlement processes using synthetic refugee data created from a variety of publicly available data suggest that our methodology may be effective in making algorithmic placement decisions that are less likely to be harmful than existing solutions.
- North America > United States > Virginia (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California (0.04)
- (13 more...)
Do No Harm: A Counterfactual Approach to Safe Reinforcement Learning
Vaskov, Sean, Schwarting, Wilko, Baker, Chris L.
Reinforcement Learning (RL) for control has become increasingly popular due to its ability to learn rich feedback policies that take into account uncertainty and complex representations of the environment. When considering safety constraints, constrained optimization approaches, where agents are penalized for constraint violations, are commonly used. In such methods, if agents are initialized in, or must visit, states where constraint violation might be inevitable, it is unclear how much they should be penalized. We address this challenge by formulating a constraint on the counterfactual harm of the learned policy compared to a default, safe policy. In a philosophical sense this formulation only penalizes the learner for constraint violations that it caused; in a practical sense it maintains feasibility of the optimal control problem. We present simulation studies on a rover with uncertain road friction and a tractor-trailer parking environment that demonstrate our constraint formulation enables agents to learn safer policies than contemporary constrained RL methods.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
DESPOT: Online POMDP Planning with Regularization
POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the "curse of dimensionality" and the "curse of history". This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios.
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Coordination of Bounded Rational Drones through Informed Prior Policy
Pushp, Durgakant, Xu, Junhong, Liu, Lantao
Biological agents, such as humans and animals, are capable of making decisions out of a very large number of choices in a limited time. They can do so because they use their prior knowledge to find a solution that is not necessarily optimal but good enough for the given task. In this work, we study the motion coordination of multiple drones under the above-mentioned paradigm, Bounded Rationality (BR), to achieve cooperative motion planning tasks. Specifically, we design a prior policy that provides useful goal-directed navigation heuristics in familiar environments and is adaptive in unfamiliar ones via Reinforcement Learning augmented with an environment-dependent exploration noise. Integrating this prior policy in the game-theoretic bounded rationality framework allows agents to quickly make decisions in a group considering other agents' computational constraints. Our investigation assures that agents with a well-informed prior policy increase the efficiency of the collective decision-making capability of the group. We have conducted rigorous experiments in simulation and in the real world to demonstrate that the ability of informed agents to navigate to the goal safely can guide the group to coordinate efficiently under the BR framework.
- North America > United States > New York (0.04)
- North America > United States > Michigan > Wayne County > Detroit (0.04)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
Minimum Description Length Control
Moskovitz, Ted, Kao, Ta-Chu, Sahani, Maneesh, Botvinick, Matthew M.
In order to learn efficiently in a complex world with multiple, sometimes rapidly changing objectives, both animals and machines must leverage information obtained from past experience. This is a challenging task, as processing and storing all relevant information is computationally infeasible. How can an intelligent agent address this problem? We hypothesize that one route may lie in the dual process theory of cognition, a longstanding framework in cognitive psychology first introduced by William James (James, 1890) which lies at the heart of many dichotomies in both cognitive science and machine learning. Examples include goal-directed versus habitual behavior (Graybiel, 2008), model-based versus model-free reinforcement learning (Daw et al., 2011; Sutton and Barto, 2018), and "System 1" versus "System 2" thinking (Kahneman, 2011).
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (3 more...)
A Fast Evolutionary adaptation for MCTS in Pommerman
Panwar, Harsh, Chatterjee, Saswata, Dube, Wil
Artificial Intelligence, when amalgamated with games makes the ideal structure for research and advancing the field. Multi-agent games have multiple controls for each agent which generates huge amounts of data while increasing search complexity. Thus, we need advanced search methods to find a solution and create an artificially intelligent agent. In this paper, we propose our novel Evolutionary Monte Carlo Tree Search (FEMCTS) agent which borrows ideas from Evolutionary Algorthims (EA) and Monte Carlo Tree Search (MCTS) to play the game of Pommerman. It outperforms Rolling Horizon Evolutionary Algorithm (RHEA) significantly in high observability settings and performs almost as well as MCTS for most game seeds, outperforming it in some cases.